[Bugfix]pass TP size to diffusion config#2867
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
e76b07a to
f3f86b5
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@NumberWan PTAL |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 596da6a029
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Please resolve conflict and recover the skip test in PR #2883 |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
596da6a to
a41e8f0
Compare
|
|
||
| # This test uses the default Bagel YAML, and CLI does not control devices.We modify yaml file directly. | ||
| _BAGEL_DEFAULT_YAML = str( | ||
| Path(__file__).resolve().parents[3] / "vllm_omni" / "model_executor" / "stage_configs" / "bagel.yaml" |
There was a problem hiding this comment.
please use get_deploy_config_path()
There was a problem hiding this comment.
There's no bagel yaml in the deploy folder
|
Please resolve some comments from the bot |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
|
Please fix UT CI failure. Thanks |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
|
@hsliuustc0106 PTAL |
Fixed @Gaohan123 |
Sorry for my missing |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
f94cd0a to
53f4522
Compare
Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: NumberWan <wantszkin2003@gmail.com>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Reason:
Fix CLI --tensor-parallel-size being silently dropped by the diffusion engine, causing a shape mismatch error during KV cache transfer when running with TP > 1.
Root cause:
OmniDiffusionConfig.from_kwargs() filters kwargs to only include fields defined directly on OmniDiffusionConfig. Since tensor_parallel_size is a field of the nested DiffusionParallelConfig (not a top-level field), it was silently discarded. This meant the DiT stage always ran with TP=1 regardless of the CLI argument.
After PR #2705 introduced _inject_inferred_kv_tp_topology, the KV transfer manager correctly inferred a heterogeneous topology (e.g. from_tp=1, to_tp=2) based on the configured TP sizes. However, the DiT stage was actually running with TP=1 due to the parameter dropping, so it expected full KV heads (e.g. 4) while receiving sliced heads (e.g. 2), resulting in:
shape mismatch: value tensor of shape [47, 2, 128] cannot be broadcast to indexing result of shape [47, 4, 128]
Fix
In OmniDiffusionConfig.from_kwargs(), forward the top-level tensor_parallel_size into parallel_config before field filtering, so CLI arguments propagate correctly to the diffusion engine. If parallel_config already explicitly sets tensor_parallel_size (e.g. from YAML), the existing value is preserved.
Test Plan
In my local environment, I changed DIT devices to 1,2 because the default yaml settings will cause OOM on my GPU if stage 0 and 1 on the same device. The default settings sets the starting offset of DIT to GPU 0.
Test Result
@yenuo26 @princepride
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)